It is September of 2020, right in the middle of the COVID-19 pandemic and during the California statewide shutdown. You have just been hired by the California state public health agency to track mortality trends for the past 4 years for which all data have currently been aggregated and de-identified (anonymized). You were given properly de-identified data from 2017 - 2020, which contain mortality records for the state of California by gender, leading cause of death, race / ethnicity, and location of death. All data have been taken from the death certificates filed with the state.
You are provided 6 tables total, which have the following variables.
| All Tables | |
| Year | Year of record |
| Month | Month of record |
| County | California county where death was recorded |
| Annual Mortality Table | |
| annualCount | Number of deaths for the sex specified |
| raceEthnicity | Race and/or Ethnicity of the deceased; Latinx is unfortunately treated as a racial category here rather than an ethnic group |
| Specialized Variables | |
| leadingCauseDeath | Leading cause of death per death certificate; See separate table for encoding |
| ageBins | Age groups where the age at death was known |
| ageDeaths | Number of deaths for the age bin specified |
| biologicalSex | Biological sex of decedent |
| sexDeaths | Number of deaths for the sex specified |
| placeDeathOccurred | Location of death, per death certificate |
| placeDeaths | Number of deaths matching specified criteria |
| raceDeaths | Number of deaths matching race / ethnic criteria |
| totalDeaths | Number of deaths per month, year, & county |
| 3-letter Code | Description | ICD-10 Codes |
|---|---|---|
| HTD | Diseases of the Heart | I00-I09, I11, I13, I20-I51 |
| CAN | Malignant Neoplasms (Cancers) | C00-C97 |
| STK | Cerebrovascular Disease (Stroke) | I60-I69 |
| CLD | Chronic Lower Respiratory Disease (CLRD) | J40-J47 |
| INJ | Unintentional Injuries | V01-X59, Y85-Y86 |
| PNF | Pneumonia and Influenza | J09-J18 |
| DIA | Diabetes Mellitus | E10-E14 |
| ALZ | Alzheimer’s Disease | G30 |
| LIV | Chronic Liver Disease and Cirrhosis | K70, K73-K74 |
| SUI | Intentional Self Harm (Suicide) | U03, X60-X84, Y87.0 |
| HYP | Essential Hypertension and Hypertensive Renal Disease | I10, I12, I15 |
| HOM | Homicide | U01-U02, X85-Y09, Y87.1 |
| NEP | Nephritis, Nephrotic Syndrome and Nephrosis | N00-N07, N17-N19, N25-N27] |
| OTH | All Other Causes of Death | All codes not listed above |
Unlike our last project, your data engineering team (i.e., me) has provided you with tables that are relatively nice and tidy to begin with - so you can dig right into the analysis. Because these are spatial data organized by California county, you first decide you will make maps to analyze both spatial (geographic) and temporal (time) trends. This may give you some insight into the downstream analysis you would like to do. Consider this to be part of your Exploratory Data Analysis.
As we discussed in lecture the last 2 weeks, one of the key types of spatial data files we might use are called shapefiles, which is what we are going to practice loading and working with shapefiles, using this open source dataset for the state of California. You will have to download these data as they are too large to put onto GiThub. Begin by downloading the requisite spatial data files. Make sure you put them in its own directory called California_County_Boundaries on your Desktop; otherwise, you will need to adjust your path in the file settings below. Notice that the directory contains six files, three of them with file extensions we discussed last week in lecture:
cnty19_1.cpgcnty19_1.dbfcnty19_1.prjcnty19_1.shpcnty19_1.shp.xmlcnty19_1.shxWhat do each of these six files represent? You will not be able to open them directly on your computer, so you may need to refer back to lecture and do a little digging to figure out the type of data you would expect each of these files to contain.
Your answer here.
We want to first get a sense for annual trends in overall mortality across the state to see what we are working with. Our first objective is to join our overall mortality data to the shapefile to be able to make maps to visualize these trends.
We are going to practice two ways to make maps from shapefiles so that you can see two ways to do it. Also, to give you more for reference, we are roughly following the GIS Basics Chapter of the Epi R Handbook (with our own twist, of course!). This chapter may be helpful for understanding more of the “nuts and bolts” of GIS as well, including definitions and why we might do things certain ways.
The initial processing steps will include: * Import the shapefile * Filter rows to keep only areas of interest * Merging with the other data tables to creates maps and analyze trends
We will leverage the read_sf() function that is part of
the sf package. As you might have guessed, sf
refers to shapefile. Note that there is an alternative
method to read in shapefiles called st_read, which is an
alias for read_sd() with some different default arguments.
By default, read_sf() is quiet, whereas
read_st() will print information about the data source by
default. Additionally, read_sf() returns a tibble rather
than a dataframe. So which to choose? It’s your preference, in this
case. We’re going to use read_sf() but we could use
read_st() if we wanted.
cali_sf <- read_sf("~/Desktop/California_County_Boundaries/cnty19_1.shp")
## Alternatively, we can read in the shapefile using the 'st_read()' function
# cali_st <- st_read("./California_County_Boundaries/cnty19_1.shp")
Next, let’s take a look at the geometry of the shapefile that we loaded.
cali_sf$geometry
## Geometry set for 69 features
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -13849230 ymin: 3833650 xmax: -12704980 ymax: 5162403
## Projected CRS: WGS 84 / Pseudo-Mercator
## First 5 geometries:
## MULTIPOLYGON (((-13611173 4566018, -13611102 45...
## MULTIPOLYGON (((-13312327 4680816, -13314595 46...
## MULTIPOLYGON (((-13366406 4679184, -13366429 46...
## MULTIPOLYGON (((-13478187 4807593, -13478186 48...
## MULTIPOLYGON (((-13360333 4641183, -13361541 46...
Notice that we have a polygon geometry in our shapefile. If you were to investigate further, you would find that currently there are no other geometries (e.g., lines or points) in this shapefile. We could add them and export the shapefile, if we ever needed to do so, like to share with someone else.
What is the bounding box that comes with our shapefile and what does it tell us?
Your answer here.
What is the projection? What does it tell us and why does projection matter?
Your answer here.
As you are already aware in R, there are often many ways to
accomplish the same goal. The same is true for mapping. In the Mini-Demo
for this module, we explored some of the “quick” mapping one can do for
EDA using ggplot2 and the maps packages. We
are going to show you how to use the tmap package now.
tmap still uses the ggplot2 underpinnings, but
it has some more bells & whistles for maps than just using
ggplot2.
With tmap, you always get started with
tm_shape(). This is equivalent to the ggplot()
function you use to start a ggplot2 map or graph. Note that
we are using the cali_sf tibble that we imported from a
shapefile.
Note that if you’d like to see the Vignette for the tmap
package, you can find it here.
## Start our map in tmap using tm_shape(); we are using the cali_sf tibble we imported
tm_shape(cali_sf) +
## Give it a default color fill
tm_fill() +
## Give it default borders
tm_borders() +
## Let's give it a title and specify its placement
tm_layout(
"California",
inner.margins = c(0, 0, 0.1, 0),
title.size = 0.8)
But what if we did just want to use ggplot2 again? Will
that read a shapefile? Yes! So, let’s try it. Notice that we are using
the geom_sf() function instead of the
geom_polygon() function we used in the Mini-Demo.
## Initiate the plot and tell it the dataset
ggplot(cali_sf) +
## Use the geom_sf function to tell it how to make the map
geom_sf() +
## Set the theme to classic
theme_classic() +
## Let's give it a title
ggtitle("California")
Now, what if we only wanted to visualize a few counties? Let’s look
at the three northernmost counties that border Oregon using the
tmap method. Notice that the county names are accessible
through cali_sf$COUNTY_NAM. We can just use the
filter() function from dplyr to be able to
grab the three northernmost counties we want by name.
## Start from the California spatial tibble
cali_sf %>%
## Filter to just the 3 counties that border Oregon
filter(COUNTY_NAM == "Del Norte" | COUNTY_NAM == "Siskiyou" | COUNTY_NAM == "Modoc") %>%
## Start our map in tmap using tm_shape()
tm_shape() +
## Give it a default color fill
tm_fill() +
## Draw some default borders
tm_borders() +
## Add text to identify the counties by name
tm_text("COUNTY_NAM",
size = 0.8) +
## Let's give it a title:
tm_layout(
"The 3 California counties that border Oregon",
inner.margins = c(0, 0, 0.25, 0),
title.size = 0.8)
Do the same with ggplot2 including adding the text
labels and the title.
# Your code here
As mentioned above, the sf_read() function from the
sf package loads our shapefile as a tibble,
which makes life much easier for us when dealing with shapefiles. It
also means we can easily join our shapefiles with other tibbles /
dataframes, like the ones we’ve imported of mortality data. Let’s
briefly breakdown what our process to do this is going to look like:
Our Process:
Rename the County column so it matches the shapefile
Summarize the data by Year so we can visualize the annual trends
Join the two tibbles by COUNTY_NAME
Make sure to turn the joined tibble back into a
shapefile using the st_as_sf() function, otherwise
it will fail. The join will convert it into a structure that is no
longer a shapefile, so you have to back-convert it to a
shapefile!
We are going to remove the islands, not because we dislike islands but because they get labeled separately and make our map look cluttered. Further, the few people who reside on the islands of California will still be included in their onshore county, so we are not ignoring anyone. Feel free to experiment removing that filter to see what I mean!
Annotate the following code chunk with comments to make sure you understand what we are doing here.
## YOUR COMMENT HERE - make sure to comment on the choice of data
mortality_sf <- totalDeaths %>%
## YOUR COMMENT HERE
rename(COUNTY_NAM = County) %>%
## YOUR COMMENT HERE
group_by(COUNTY_NAM, Year, Month) %>%
## YOUR COMMENT HERE
summarise(totalDeaths = sum(totalDeaths, na.rm = TRUE)) %>%
## YOUR COMMENT HERE - make sure to comment on the relationship parameter
full_join(cali_sf, by = "COUNTY_NAM", relationship = "many-to-many") %>%
## YOUR COMMENT HERE
st_as_sf() %>%
## YOUR COMMENT HERE
filter(is.na(ISLAND))
What does st_join() do? Could you have executed the join
with st_join() instead of full_join() here?
Why or why not?
Your answer here.
We are asked to make four maps, one for each year, that shows how
annual overall mortality trends might be changing as a result of the
COVID-19 pandemic. Notice that we are using the tmap
package again. Notice also that I am first using a
group_by() and a summarize() to tally the
total deaths per county per year. Lastly, let’s apply a
tm_facet() to Year to visualize the total
deaths by California counties for each of the four years.
## Start from the mortality spatial tibble
mortality_sf %>%
## Group by and then summarize to total the deaths by County and Year
group_by(COUNTY_NAM, Year) %>%
summarise(totalDeaths = sum(totalDeaths, na.rm = TRUE)) %>%
## Plot with tmap: tm_shape gets us started
tm_shape() +
## Chloropeth: draw the polygons of the counties and fill with color per the total deaths
tm_polygons(col = "totalDeaths",
palette=get_brewer_pal(palette="BuPu", n=10, plot=FALSE),
breaks = c(0, 50, 500, 5000, 25000, 50000, 80000),
alpha = 1,
title = "Total Deaths, All Causes") +
## Draw some borders
tm_borders() +
## Label the counties with their names
tm_text("COUNTY_NAM", size=0.25, col = "darkgrey") +
## Facet by year
tm_facets(by = "Year", ncol = 2) +
## Make a grid layout
tm_layout(legend.position = c("right", "bottom"),
title.position = c('right', 'top'))
Hopefully, you noticed very quickly that the more populous counties in California, e.g., Los Angeles county, have very high mortality rates. Well, duh! If they have more folks… mortality will be higher. So, we need to adjust for population size! But how can we get population data if our dataframes do not already include those data. We could look it up or even download it from the US Census Bureau OR… we could learn how to access data even faster (which also means we can more quickly and iteratively access data) through the Census Bureau’s API.
If you have not yet worked with an API, never fear! That’s why we’re doing it here. An API (Application Programming Interface) is an intermediary that allows you to access data or other information often from a database (but it doesn’t have to be) in JSON (Java Script Object Notation) or other format. By connecting to the API for the Census Bureau, we will be able to download instantaneously just the data we want, which in our case will be population sizes and some basic demographics about the breakdown of population by sex and by individuals over the age of 65 years.
Because the US Census API uses JSON formatted files, we will be using
the jsonlite package to convert the JSON files to a
dataframe. But first, you need to access the API!
In order to access the Census API, you will need your own Census API key. I could give you mine, but it’s generally not good practice to share API keys unless you trust the person. Why? (1) If someone misuses your API key, you could get banned from accessing the API. (2) Especially for government or non-profit agencies, it can be a good idea to have your own so that they can keep track of how many users they have. Not all API access is free, but for government, non-profit, or institutionally held data, access is free although you may have different hoops to jump through for access depending on how secure they want to keep the data. (e.g., to protect survey respondents).
You will sign up for your API census key here. You should be able to obtain your key very quickly, but if you run into trouble then I will give you the data I have downloaded through my API key so that you can still keep going with the project.
Once your have your key, fill it into this code chunk here:
## Set your API key (put it between the quotation marks):
key <- ""
We are going to use the 2020 Census since that is the last year of data we have for the mortality dataset. The U.S. Census maintains numerous tables you can use for data. We will be using the Decennial Census: Demographic Profile. You can find the variable explanations for this table of the 2020 Census data here. Below is a custom function I have written that will allow us to extract any of the variables we want from this table as long as you follow the usage.
NOTE Make sure to change include=FALSE
to TRUE in the first line (the {r} setup line)
the chunk or you will get an error upon knitting.
Did you notice how quickly it accessed the information? That is one of the strengths of APIs. You also get to customize the tables to only grab what you want and you can imagine how you might iterative over multiple years or other structures using a function similar to what I coded for us above. APIs are wonderful to become familiar with as a data scientist; I highly recommend practicing more with APIs as you move toward the job market as it is a highly sought after skill!
Your task is to join the census data to the shapefile of overall
mortality we’ve already made. We want a tibble that contains all of the
data from the census dataframe and all of the data from
mortality_sf. I called mine mortality_sf2 but
you may call yours anything you’d like.
Hint 1: The easiest way is to use the FIPS data to join.
Hint 2: Don’t worry about calculating % mortality yet.
Hint 3: If you’re having problems, check your data types!
# Your code here.
Note: If you’re unable to do this for any reason, you can pick back up here. You will need to download the data from Google Drive, if needed, as the shapefile is too large for GitHub yet again. I have placed mine on my Desktop, but you can put it anywhere you like and update the directory path. Just make sure not to put it in your GitHub repo!
load("~/Desktop/mortality_sf2.Rdata")
Now, we were all so excited, we decided to calculate the annual
population-adjusted % mortality per county per year so that we could
recreate the map we tried to make above. This is what we did… (notice
that, this time, I set the color scale for the chloropeth to equal the
five number summary from fivenum() so that
we could more easily see the median versus the quartiles.)
## [1] "The five-number summary of annual % total mortality is: "
## [1] "Min 0" "Lower Quartile 0.494" "Median 0.648"
## [4] "Upper Quartile 0.778" "Max 1.392"
Now let’s plot it:
What patterns do you observe now, if any? Do you think this is a true pattern? Aren’t we being told that deaths are going up due to COVID-19??
Your answer here.
Ohhhh… that’s right. The pattern observed is simply a by-product of an incomplete record from 2020, isn’t it?! We only have 8 months’ of observations. You realize at this point you were so excited to jump into map-making that you missed something in how you were doing your calculations and/or planning your maps.
## Remove the extra characters from the name of the county
census <- census %>%
mutate(NAME = gsub(" County, California", "", NAME)) %>%
rename(County = NAME)
## Join with the total deaths
totalDeaths <- totalDeaths %>%
full_join(census, by = "County")
Let’s now look ook at annual and monthly trends in total numbers of deaths
Why did this happen and how could we fix it? (There is more than one correct answer possible.) Could we have made our maps this way and had them be accurate if we’d had a full 12 months’ of data for each year?
Your answer here.
Although there are several ways we could tackle this, for the map I am going to choose to take an average of monthly % mortality so that it no longer matters whether we have 8 months’ or 12 months’ of data available for any given year.
## Calculate the monthly percent mortality
mortalityMonthly_sf2 <- mortality_sf2 %>%
group_by(COUNTY_NAM, Year, Month) %>%
summarize(percMortality = 100*(sum(totalDeaths, na.rm = TRUE)/mean(totalN))) %>%
## Note that I have to divide by the mean of the N from the Census or I'd get an error
ungroup()
## Find the five number summary
summ <- fivenum(mortalityMonthly_sf2$percMortality)
print("The five-number summary of % mortality is:")
## [1] "The five-number summary of % mortality is:"
summ
## [1] 0.00000000 0.04780236 0.05833171 0.07265292 0.16220308
Now try executing a different way to more accurately portray the %
mortality. (Hint: If you need an idea to get you
started, ask yourself if you could filter the mortality_sf2
tibble in some way to restrict what is plotted.) What do you observe,
and how do the results compare to the map I chose to make?
# Your code here
Your answer here.
Your stakeholders have asked if you can quickly give them some idea of what to expect mortality will look like for the remaining months of 2020. Now, although this is not going to be the most robust method for time series projection available (a caveat!), you immediately tell me, “Oh, I could do a regression.” So, you do.
mod <- lm(percMortality ~ COUNTY_NAM + Year + Month, data = mortalityMonthly_sf2)
# plot(mod)
## Make the new data:
## Get the names of the counties; repeat four times for the 4 months
COUNTY_NAM <- rep(unique(mortalityMonthly_sf2$COUNTY_NAM), 4)
## Make a set of months for each of the 58 counties for Sept, Oct, Nov, Dec
Month <- c(rep(9, 58),
rep(10, 58),
rep(11, 58),
rep(12, 58))
Year <- rep(2020, length(Month))
newdata <- cbind.data.frame(COUNTY_NAM, Year, Month)
## Attach it to the dataframe and then call it the same name as the original Y
newdata$propDeaths <- predict(mod, newdata = newdata)
newdata$isPredicted <- "1"
Let’s finish out the year of 2020 with predictions (shown in bubble points):
Do you think this regression prediction is robust? Why or why not?
(Hint: Check your assumptions with
plot(mod)!!)
# Your code here.
Your answer here.
By now, you have probably realized that the regression we did is neither adequate nor to be trusted. Your task is to improve the prediction. Lay out a brief analysis plan, defend your choice of machine learning algorithm, and attempt to update at least the map if not also the line graph. Be sure to include an RMSE or some other metric to show how your prediction is performing.
Wait, how? Ah, the pandemic is over in reality! You can find the actual number of deaths, per county, here.
Hint: You can read this file in as a dataframe using JSON API as follows!
deaths2020 <- fromJSON("https://data.chhs.ca.gov/datastore/odata3.0/579cc04a-52d6-4c4c-b2df-ad901c9049b7?$top=5&$format=json")
Remember to filter for the proper condition, Year, and Months to get the data you need for your testing set.
Your analysis plan here.
# Your code here. You can add more code chunks as needed to do the work asked!
Your conclusions here.
If this open-ended analysis feels overwhelming, just ask for more hints to get you started!
California was one of the strictest states when it came to shutdown measures during the COVID-19 pandemic, with both statewide enforcement and a duration longer than some other states in the country. California declared a state of emergency on March 4, 2020 with a a mandatory statewide stay-at-home order issued on March 19, 2020. This statewide order didn’t end until January 25, 2021, but it still wouldn’t be until April 6, 2021 when the state announced plans to fully reopen the economy by June 15, 2021.
Your stakeholder is actually not as interested in deaths from COVID-19, but actually more interested in how the mandatory stay-at-home order seems to affecting mortality due to certain conditions. For example, are people less likely to die from accident or injury since they’re under stay-at-home orders?
Because of the mandatory COVID-19 shutdown in California, which has led to generally reduced movement throughout California, your stakeholder would like to know if the number of accidental deaths declined enough during COVID-19 shutdown to generate a signal.
HYPOTHESIS: I hypothesize that because of intense,
mandatory stay-at-home measures, deaths attributed to accidents and
injuries (INJ) declined during the COVID-19 shutdown during
the months of 2020 in California.
So, let’s test it! We will start, as with all good analysis plans, with some more Exploratory Data Analysis.
INJ MortalityThe leading causes of death is located in the sexDeaths
table because that was how the data were provided to us. They are
separated by sex, thus we can check to see if there are sex differences
we should worry about before proceeding. We do need to join some of our
tables and calculate some additional metrics as well, including:
Proportion of deaths from accident or injury out of all deaths (no longer adjusting for population size)
Ratio of adult females to males in the population
Ratio of females to males in the population among those over 65 years of age
Ratio of all adults over the age of 65 (retirement age) to those under retirement age
We have opted to adjust per total deaths, rather than population size, so that we can now better examine the fraction of deaths due to particular causes regardless of the underlying size of the population. However, we do need to account for the ratio of females to males in the population, as well as it might be useful now or in future analyses to have access to the ratio of females to males just among those over the age of 65 as well as the ratio of retirees (65+ years old) to non-retirees in the population.
## Join the sex deaths table with the census data; then filter for INJ
injDeaths <- sexDeaths %>%
full_join(census, by = "County") %>%
filter(leadingCauseDeath == "INJ") %>%
## Now full join with the first 4 columns of the totalDeaths table to get total deaths also
full_join(totalDeaths[ , 1:4], by = c("County", "Year", "Month")) %>%
## rename the sexDeaths column to something more appropriate
rename(injuryDeaths = sexDeaths) %>%
## Calculate the proportion of all deaths that are from injury RELATIVE TO ALL DEATHS (not population)
mutate(deathsInjuriesProportion = injuryDeaths/totalDeaths,
femaleMaleRatio = femalesN/malesN, ## The ratio of females to males in the population
femaleMaleRatio65 = `females65+`/`males65+`, ## The ratio of 65+ females to males in the population
retireeRatio = over65/totalN, ## Ratio of retirees to non-retirees
Date = as.Date(paste(Year, Month, "1", sep = "-"), format = "%Y-%m-%d"), ## Convert the date to a date format for plotting
deathsInjuriesProportion = ifelse(deathsInjuriesProportion == "NaN", 0, deathsInjuriesProportion)) ## Illegal div by 0 turned to NaN; turn back to 0
Is it fair to model the values without accounting for an effect of sex, do you think? Why or why not?
Your answer here.
What if we calculate the proportion of deaths, for each gender, instead? Wouldn’t that give us a more accurate portrait of any sex-level differences that might exist?
injDeaths %>%
group_by(Month, biologicalSex, Year) %>%
summarise(deathsInjuriesProportion = mean(deathsInjuriesProportion, na.rm = TRUE)) %>%
ggplot(aes(x = Month,
y = deathsInjuriesProportion,
color = as.factor(Year),
linetype = as.factor(biologicalSex))) +
geom_line(alpha = 0.7,
size = 1.1) +
scale_x_continuous(n.breaks = 10) +
scale_color_manual(values = c("#d45731", "cornflowerblue", "#925ebf", "#86a607")) +
scale_linetype_manual(values=c("dotdash", "solid")) +
labs(title = "Proportion all deaths due to accident or injury, by sex",
y = "Proportion of Deaths Due to Accident or Injury",
subtitle = "Cumulative for all California counties, Jan 2017 - Aug 2020",
color = "Year",
linetype = "Sex of Decedent") +
theme_minimal()
## `summarise()` has grouped output by 'Month', 'biologicalSex'. You can override
## using the `.groups` argument.
What do you notice about the trend? Is there anything compelling here potentially? Can you find any possible explanation for what happened in May (month 5) of 2020?
Your answer here.
With the previous model, I did a subpar linear regression in the hopes that you would improve upon my predictive model. This time, since I am going to have you focus on the exploration side of things, I am going to attempt to make a slightly better model – but please be aware that this is NOT as robust as I would like it to be to return to stakeholders. This is more for demonstration purposes so that we can get to our true goal: mapping predictions!
I know from the earlier model diagnostic plots that the linear regression assumptions were not upheld. We have several options we could explore here to make a better model, including performing a Box-Cox transformation, considering a LASSO regression (especially since there could be high collinearity among some counties, especially those that are more urban vs. rural), or even switching to a regression algorithm that is agnostic to the underlying assumptions of Ordinary Least Squares, like a Random Forest. Here, the latter is exactly what we will implement - and we will add another ensemble method, Gradient Boosting Machine (GBM) for comparison.
Before we move forward, let’s also clarify our question a bit.
What effect has County, Year,
Month, stayAtHome, biologicalSex,
femaleMaleRatio, totalN, and
retireeRatio had on the proportion of all deaths that have
occurred due to accident or injury?
In other words, does the proportion of accident / injury-based deaths differ by the county where the decedent lived, the month in which the death occurred, whether it is during the COVID-19 stay-at-home order or not, the biological sex of the decedent, the sex ratio of adult females to males in the county, the population size of the county, and the ratio of older adults over 65 years relative to non-retirees?
Wait, why am I throwing retireeRatio in there? Ahh, we
can think of reasons why accidental deaths might change over the
duration of one’s life. Perhaps the relative proportion is unchanged and
only the cause of the accidental death might change (i.e., more
stunt-based deaths when younger, more falls when older) OR the actual
proportion might decline with age because people are generally less
active and move / drive around less.
There really isn’t much we need to do here, but I will go ahead and strip the dataset down to just the columns we are going to focus on in the analysis as well as drop any missing values. I am not going to do any imputation of missing values this time because there are some counties for which we just have had no reporting. We don’t want to impute for a county that has no data at all.
injDeaths_cleaned <- injDeaths %>%
dplyr::select(County, Year, Month, stayAtHome, biologicalSex, femaleMaleRatio, totalN, retireeRatio, deathsInjuriesProportion) %>%
drop_na()
Before we do any encoding or transformations, let’s do our test-train split as we’ve already discussed.
Remember that little function we annotated from last project, the one
called calcSplitRatio? We are going to load that in and use
that function again to help use find our optimum split ratio! I have put
it into a R script, calcSplitRatio.R. Then, we
need to pass it our dataframe, injDeaths as well as the
number of parameters we expect to test. Here the number of parameters is
the 58 counties in the dataset as well as the 6 other features
(parameters here) we plan to use (Month,
stayAtHome, biologicalSex,
femaleMaleRatio, totalN, and
retireeRatio).
source("calcSplitRatio.R")
train_prop <- calcSplitRatio(df = injDeaths_cleaned, p = (length(unique(injDeaths_cleaned$County)) + 7)) ## 58 counties + 5 more parameters
## [1] "The ideal split ratio is 0.88:0.12 training:testing"
Not too unsurprisingly since this is a smaller dataset, it is recommending a larger training set than we’ve seen before.
ind <- createDataPartition(injDeaths_cleaned$deathsInjuriesProportion,
p = train_prop,
list = FALSE)
train <- injDeaths_cleaned[ind, ]
test <- injDeaths_cleaned[-c(ind), ]
Next, we need to do our encoding. We are going to do one-hot encoding
for all of categorical variables of County,
stayAtHome, and biologicalSex. We will
manually make for the one-hot encodings for stayAtHome and
for biologicalSex since there are only two conditions in
each of those columns; the dummyVars() function will try to
turn them into two columns, which is both unnecessary AND increases our
dimensionality! But we will leverage dummyVars() from the
caret package for the County variable to make
our life so much easier.
Note: I have encoded males as 1 because
it is the condition in which we expect a higher proportion of
injury-related deaths. Thus, females have been made the reference
condition. Additionally, I encoded the stayAtHome condition
of None as 0, making it the reference condition.
encodeTrain <- function() {
## Make one-hot encoded variables manually
train <- train %>%
dplyr::mutate(stayAtHome = ifelse(stayAtHome == "None", 0, 1),
biologicalSex = ifelse(biologicalSex == "Male", 1, 0))
## Now make the one-hot encoded County
county <- dummyVars(" ~ County", data = train)
trainReady <- predict(county, newdata = train) %>% ## Do the encoding
data.frame() %>% ## Make a dataframe
cbind(train) %>% ## Add back to the train set
select(-County) ## Drop the original county
return(trainReady)
}
trainReady <- encodeTrain()
encodeTest <- function() {
## Make one-hot encoded variables manually
test <- test %>%
mutate(stayAtHome = ifelse(stayAtHome == "None", 0, 1),
biologicalSex = ifelse(biologicalSex == "Male", 1, 0))
county <- dummyVars(" ~ County", data = test)
testReady <- predict(county, newdata = test) %>% ## Do the encoding
data.frame() %>% ## Make a dataframe
cbind(test) %>% ## Add back to the test set
select(-County) ## Drop the original county
return(testReady)
}
testReady <- encodeTest()
Now, there’s one little problem… that’s a BIG
problem. Not all of the counties have columns in the
testReady dataset. Why not? Well, if the category
was missing from the County column of the test data set,
then the column would not be made by dummyVar()!
So, we will need to make a blank column full of zeros for any columns
missing from testReady. There is a handy function,
setdiff(), that will allow us to compare two vectors (the
names of the columns of the dataframe) and return what is present in the
first vector but missing in the second vector.
fixMissingColumns <- function() {
## Store the missing columns into a vector
columns_missing <- setdiff(names(trainReady), names(testReady))
columns_missing
## Make the new columns and populate them with zeros
testReady[, columns_missing] <- rep(0, nrow(testReady))
print(paste0("Column comparison: Are they the same now in test and train now? ",
ncol(testReady) == ncol(trainReady)))
return(testReady)
}
## Encode train again
trainReady <- encodeTest()
testReady <- fixMissingColumns()
## [1] "Column comparison: Are they the same now in test and train now? TRUE"
We can see that the testReady dataset now has the same
number of columns as the trainReady dataset. Yay! Big
little problem solved.
This time, we will take advantage of the preProcess()
function in caret as an alternative method. Last time, we
used scale(). Notice that what preProcess()
does is centers and scales relative to the training data and
applies the same centering & scaling factors to the test set.
centerScale <- preProcess(trainReady, method = c("center", "scale"))
trainReady <- predict(centerScale, trainReady)
testReady <- predict(centerScale, testReady)
## Set the control for 10-fold CV
ctrl <- trainControl(method = "cv",
number = 10,
verbose = FALSE)
## Run a RF with a random search grid
RF <- train(deathsInjuriesProportion ~ .,
data = trainReady,
method = "ranger",
trControl = ctrl,
search = "random",
importance = 'impurity')
## Make the predictions against the test dataset
predRF <- predict(RF,
newdata = testReady)
## Look at the R-squared, RMSE, and MAE
round(postResample(predRF, test$deathsInjuriesProportion), 4)
## RMSE Rsquared MAE
## 0.8845 0.9352 0.7775
We have not used a gradient boosting machine yet, so I want to take a moment to orient you to it if you are not familiar with it. Just as with Random Forest, GBMs can be classification or regression models. Gradient boosting is an ensemble method like RF, but whereas RF relies on constructing the forest by improvement of the average error of the models in the ensemble, GBMs are based on a different strategy. In boosting methods, new models are added to the ensemble sequentially instead. At each iteration, a new weak learner model is trained based on the error of the ensemble learned thus far. Gradient boosting methods thus either really outperform other methods OR are severely overfitted, depending on the sequential addition of models to the ensemble. This is because the addition of each new weak learner model will either provide a more accurate estimate of the target or it will be overtrained to the training dataset, thus leading to overfitting.
Each of the weak learners added will have maximum correlation with the negative gradient of the loss function, which is set for the entire ensemble. Loss functions can be applied arbitrarily, but it is best to align the loss function with the squared-error loss when possible. One even has the option to assign your own task-specific loss, if needed. You would just need to know what error you are attempting to optimize! Further, GBMs are extremely customizable and one can perform loss function trial and error as well.
The weak learners can also be chosen in addition to the loss function, meaning that you can design your GBM to use linear regression, Ridge regression, decision trees, or even your own weak learner!
Look for a reliable resource that explains how the GBM algorithm works and do your best to attempt to explain what is happening at each iteration. Make sure to give a link or citation.
Your answer here.
What hyperparameters can one tune in a GBM? Hint: Make sure to explain at least 4!
Your answer here.
Okay, let’s execute it!
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9, 15),
n.trees = (1:30)*50,
shrinkage = c(0.05, 0.1, 0.15),
n.minobsinnode = c(20, 30, 50))
## Run a GBM with a pre-defined search grid
GBM <- train(deathsInjuriesProportion ~ .,
data = trainReady,
distribution = "gaussian",
method = "gbm",
trControl = ctrl,
verbose = FALSE)
## Make the predictions against the test dataset
predGBM <- predict(GBM,
newdata = testReady)
## Look at the R-squared, RMSE, and MAE
round(postResample(predGBM, test$deathsInjuriesProportion), 4)
## RMSE Rsquared MAE
## 0.8807 0.8906 0.7711
Which model is superior and why? Do you trust these results?
Your answer here.
Do you trust the variable importance results? Is there any difference in the variable importance between the two models?
Your answer here.
Do you think there is any effect of geography? Why or why not? Would it be worth mapping? (Hint: Does the highest-importance variable suggest that there IS or IS NOT an effect of geography?)
Your answer here.
If we were to proceed with mapping, what recommendations or changes should we implement beforehand? (Try to think about this relative to mapping, predicting, and then mapping the predictions of all mortalities).
Your answer here.
Do we have any indication, thus far, that the pandemic is reducing the number of injury-related mortalities? Why or why not? What might we need to be able to see a stronger variable importance? (Hint: “More time” or “more deaths” are viable options, but there’s another more practical example as well.)
Your answer here.
Originally, I wanted you to test your own condition. Although I would still love for you to use these data on your own to explore the dataset further, to abbreviate this exercise and not overwhelm you with work before your final project, we will instead only make ONE map after we plan our question, hypothesis, and analysis.
Formulate a question based on these data that you would like to test relative to the effect of the stay-at-home order in California. Can you find any articles or reports to support this? (One link or citation required).
Your answer here.
Formulate a hypothesis based on your question.
Your answer here.
Formulate an analysis plan, including the maps you would make and the predictive algorithm you would use to test your question. Please give at least a few very descriptive sentences for full credit.
Your answer here.
Attempt to make a single map, as I did for the injury-related deaths relative the stay-at-home order. Test your question about the shutdown, visually and with a map.
# Your code here. Add more chunks if needed / desired.
What trends, if any, do you notice? Why do you come to this conclusion?
Your answer here.